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Abstract 

Similarity between objects is multi-faceted and it can be 
easier for human annotators to measure it when the fo¬ 
cus is on a specific aspect. We consider the problem of 
mapping objects into view-specific embeddings where 
the distance between them is consistent with the sim¬ 
ilarity comparisons of the form “from the t-th view, 
object A is more similar to B than to C”. Our frame¬ 
work jointly learns view-specific embeddings exploiting 
correlations between views. Experiments on a number 
of datasets, including one of multi-view crowdsourced 
comparison on bird images, show the proposed method 
achieves lower triplet generalization error when com¬ 
pared to both learning embeddings independently for 
each view and all views pooled into one view. Our 
method can also be used to learn multiple measures of 
similarity over input features taking class labels into ac¬ 
count and compares favorably to existing approaches 
for multi-task metric learning on the ISOLET dataset. 

Introduction 

Measure of similarity plays an important role in applica¬ 
tions such as content-based recommendation, image search 
and speech recognition. Therefore a number of techniques 
to learn a measure of similarity from data have been pro¬ 
posed (Xing et al. 2002; Davis et al. 2007; Weinberger, 
Blitzer, and Saul 2006; McFee and Lanckriet 2011). When 
the measure of distance is induced by an inner product in a 
low-dimensional space as is done in many studies, learning a 
distance metric is equivalent to learning an embedding of ob¬ 
jects in a low-dimensional space. This is useful for visualiza¬ 
tion as well as using the learned representation in a variety of 
down-stream tasks that require fixed length representations 
of objects as has been demonstrated by the applications of 
word embeddings (Mikolov et al. 2013) in language. 

Among various forms of supervision for learning dis¬ 
tance metric, similarity comparison of the form ‘object A is 
more similar to B than to C”, which we call triplet com¬ 
parison, is extremely useful for obtaining an embedding 
that reflects a perceptual similarity (Agarwal et al. 2007; 
Tamuz et al. 2011; van der Maaten and Weinberger 2012). 
Triplet comparisons can be obtained by crowdsourcing, or it 
may also be derived from class labels if available. 

The task of judging similarity comparisons, however, can 
be challenging for human annotators. Consider the problem 
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Figure 1: Ambiguity in similarity. Depending on whether 
we focus on the back (middle row) or on the head (bottom 
row), bird A may appear more similar to 5 or C. 

of comparing three birds as seen in Fig. 1. Most annotators 
will say that the head of bird A is more similar to the head of 
B while the back of A is more similar to C. Such ambiguity 
leads to noise in annotation and results in poor embeddings. 

A better approach would be to tell the annotator the de¬ 
sired view or the perspective of the object to use for mea¬ 
suring similarity. Such view-specific comparisons are not 
only easier for annotators, but they can also enable precise 
feedback for human “in the loop” tasks, such as, interactive 
fine-grained recognition (Wah, Maji, and Belongie 2015), 
thereby reducing the human effort. The main drawback of 
learning view specific embeddings independently is that the 
number of similarity comparisons scales linearly with the 
number of views. This is undesirable as even learning a sin¬ 
gle embedding of N objects may require 0{N^) triplet com¬ 
parisons (Jamieson and Nowak 2011) in the worst case. 

We propose a method for learning embeddings jointly that 
addresses this drawback. Our method exploits underlying 
correlations that may exist between the views allowing a bet¬ 
ter use of the training data. Our method models the correla¬ 
tion between views by assuming that each view is a low-rank 
projection of a common embedding. Our model can be seen 
as a matrix factorization model in which local metric is de¬ 
fined as LJVItL^, where L is a matrix that parametrizes 






the common embedding and is a positive semidefinite 
matrix parametrizing the individual view. The model can be 
efficiently trained by alternately updating the view specific 
metric and the common embedding. 

We experiment with a synthetic dataset and two realis¬ 
tic datasets, namely, poses of airplanes, and crowd-sourced 
similarities collected on different body parts of birds (CUB 
dataset; Welinder et al., 2010). On most datasets our joint 
learning approach obtains lower triplet generalization error 
compared to the independent learning approach or naively 
pooling all the views into a single one, especially when the 
number of training triplets is limited. Furthermore, we apply 
our joint metric learning approach to the multi-task metric 
learning setting studied by (Parameswaran and Weinberger 
2010 ) to demonstrate that our method can also take input 
features and class labels into account. Our method compares 
favorably to the previous method on ISOLET dataset. 


Jointly learning multiple metrics 

Now let’s assume that T sets of triplets <Si,..., <St are avail¬ 
able. This can be obtained by asking annotators to focus on 
a specific aspect when making pair-wise comparisons as in 
human in the loop tasks (Wah et al. 2014; Wah, Maji, and 
Belongie 2015). Alternatively, different measures of similar¬ 
ity can come from considering multiple related metric learn¬ 
ing problems as in (Parameswaran and Weinberger 2010; 
Rai, Lian, and Carin 2014). 

While a simple approach to handle multiple similarities 
would be to parametrize each aspect or view by a positive 
semidefinite matrix Mt, this would not induce any shared 
structure among the views. Our goal is to learn a global 
transformation L that maps the objects in a common D di¬ 
mensional space as well as local view-specific metrics Mt 
(t = l,...,T). 

To this end, we formulate the learning problem as follows: 


Formulation 

In this section, we first review the single view metric learn¬ 
ing problem considered in previous work. Then we extend it 
to the case where there are multiple measures of similarity. 


Metric learning from triplet comparisons 

Given a set of triplets S = {(i,j,/c) | 

object i is more similar to object j than object k} and 
possibly input features xi,...,xn G we aim 

to find a positive semidefinite matrix M G 
such that the pair-wise comparison of the distances 
induced by the inner product {x^y)j^ = x^My 

parametrized by M (approximately) agrees with S, i.e., 
G S ^ \\xi - XjWlj < 11®* - XkWii. If no input 
feature is given, we take Xi as the ith coordinate vector in 
and learning M, which would become N x N, would 
correspond to finding embeddings of the N objects in a 
Euclidean space with dimension equal to the rank of M. 

Mathematically the problem can be expressed as follows: 


min 

Myo 


{i,j,k)es 


\\xi - XkWii) + 
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where ||x — = {x — y)^M{x — y); the loss func¬ 

tion can be, for example, logistic (Cox et al. 2000), or hinge, 
i{dij,di^k) = max(l -f dij — di^k^ 0) (Agarwal et al. 2007; 
Weinberger, Blitzer, and Saul 2006; Chechik et al. 2010). 
Other choices of loss functions lead to crowd kernel learn¬ 
ing (Tamuz et al. 2011), and t-distributed stochastic triplet 
embedding (t-STE) (van der Maaten and Weinberger 2012). 
Penalizing the trace of the matrix M can be seen as a con¬ 
vex surrogate for penalizing the rank (Agarwal et al. 2007; 
Eazel, Hindi, and Boyd 2001). 7 > 0 is a regularization pa¬ 
rameter. 

After the optimal M is obtained, we can find a low-rank 
factorization of AT as AT = LL^ with L G This 

is particularly useful when no input feature is provided, be¬ 
cause each row of L, which is A" x D in this case, corre¬ 
sponds iodiD dimensional embedding of each object. 
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where := £{\\L'^{xi - Xj)\\%j,\\L~''{xi - 

Xk)\\\/[), and ^ is a loss function as above. We use the 
hinge loss in the experiments in this paper, but the proposed 
framework readily generalizes to other loss functions pro¬ 
posed in literature (Tamuz et al. 2011; van der Maaten and 
Weinberger 2012). Note again that when no input feature is 
provided, the global transformation matrix L becomes an 
N X D matrix that consists of D dimensional embedding of 
the objects. 

Intuitively the global transformation L plays the role of a 
bottleneck and forces the local metrics to share the common 
D dimensional subspace because they are restricted in the 
form LMfL^. 

The proposed model (2) includes various simpler models 
as special cases. Eirst, if T is an iT x iT identity matrix, there 
is no sharing across different views and indeed the objective 
function will decompose into a sum of view-wise objectives; 
we call this independent learning. On the other hand, if we 
constrain all Mt to be equal, the same metric will apply to 
all the views and the learned metric will be essentially the 
same as learning a single shared metric as in Eq. (1) with 
S = we call this pooled learning. 

We employ regularization terms for both the local metric 
Mt and the global transformation matrix L in (2). The trace 
penalties tr(ATt) are employed to obtain low-rank matrices 
Mt as above. The regularization term on the norm of L is 
necessary to resolve the scale ambiguity. Although the above 
formulation has two hyperparameters p and 7, we show be¬ 
low in Proposition 1 that the product (3^ is the only hyper¬ 
parameter that needs to be tuned. 

To minimize the objective (2), we update AT^’s and L al¬ 
ternately. Both updates are (sub)gradient descent. The Mt 
update is followed by a projection onto the positive semi¬ 
definite (PSD) cone. Note that if we choose a convex loss 






function, e.g., hinge-loss, then it becomes a convex problem 
with respect to Mt's and Mt's can be optimized indepen¬ 
dently since they appear in disjoint terms. The algorithm is 
summarized in Algorithm 1. 

Effective regularization term The sum of the two regu¬ 
larization terms employed in (2) can be reduced into a single 
effective regularization term with only one hyperparameter 
as we show in the following proposition (we give the 
proof in the supplementary material). 

Proposition 1. 


T 

+/3||^IIf = 2v^tr 

s.t. LMtL^ = Kt (Vi) 

where the power 1/2 in the r.h.s. is the matrix square root. 

As a corollary, we can always reduce or maintain the reg¬ 
ularization terms in (2) without affecting the loss term by 
the rescaling Mt v- Mt/o? and L ^ aL with a = 

(7Er=itr(M*)/(/?||i|||))i/4. 

Number of parameters A simple parameter counting ar¬ 
gument tells us that independently learning T views requires 
to fit 0{DHT) parameters, where H is the number of input 
dimension, which can be as large as N, D is the embedding 
dimension, and T is the number of views. On the other hand, 
our joint learning model has only 0{HD ^ D‘^T) parame¬ 
ters. Thus when D < H, our model has much fewer param¬ 
eters and enables better generalization, especially when the 
number of triplets is limited. 

Efficiency Reducing the dimension from H to D by the 
common transformation L is also favorable in terms of com¬ 
putational efficiency. The projection of to the cone of 
D X D PSD matrices is much more efficient when D ^ H 
compared to independently learning T views. 

Learning embeddings from triplet 
comparisons 

In this section, we demonstrate the statistical efficiency of 
our model in both the triplet embedding (no input feature), 
and multi-task metric learning scenarios (with features). 

Experimental setup 

On each dataset, we divided the triplets into training and test 
and measured the quality of embeddings by the triplet gen¬ 
eralization error, i.e., the fraction of test triplets whose re¬ 
lations are incorrectly modelled by the learned embedding. 
The error was measured for each view and averaged across 
views. The numbers of training triplets were the same for 
all the views. The regularization parameter was tuned using 
a 5-fold cross-validation on the training set with candidate 
values 10“^,..., 10^}. The hinge loss was used as 

the loss function. We use mmax = 20 as the number of inner 
iterations in the experiments. 

In addition, we inspected how the similarity knowledge 
on existing views could be transferred to a new view where 



Algorithm 1: Multiple-metric Learning 

Input: the number of objects N (or input features 

dimension of embedding D\ triplet constraints St, 
t = 1,..., T; regularization parameters /3, 7; the 
number of inner gradient updates mmax 
Output: Global transformation L; PSD matrices 
Initialize L randomly; initialize Mt as identity matrices; 
while not converged do 

Update L using step-size r\ — r\t)l for mmax times as 

E ^ LPi,j,k{L, Mt) + 2/5L 
[t=i (ij,k)est 

for t e {1, 2,..., T} do 

Update Mt using step-size rj = rjojySri for mmax 
times by taking a gradient step 

Mt <^Mt ~ \ ^ ^ ^MtPij,k{L, Mt) + ylo 

\^i,3,k)eSt 

and projecting Mt to the PSD cone; 

end 

end 
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Figure 2: (Left) View-specific similarities between poses of 
planes were obtained by considering subsets of landmarks 
shown by different colored rectangles and measuring their 
similarity in configuration up to a scaling and translation. 
(Right) Perceptual similarities between bird species were 
collected by showing users either the full image (view 1), 
or crops around various parts (view 2,3,4,5,6). The aver¬ 
age image for each view is also shown. 


the number of similarity comparisons is small. We did this 
by conducting an experiment in which we drew a small set 
of training triplets from one view but used large numbers of 
training triplets from the other views. 

We compared our method with the following two 
baselines. Independent: We conducted triplet embedding 
on each view treating each of them independently. We 
parametrized M = LL^ with L G and mini¬ 

mized (1) using the software provided by van der Maaten 
and Weinberger (2012). Pooled: We learned a single embed¬ 
ding with the training triplets from all the views combined. 

Synthetic data 

Description Two synthetic datasets were generated. One 
consisted of 200 points uniformly sampled from a 10 di¬ 
mensional unit hypercube, while the other dataset had 200 
objects from a mixture of four Gaussian with variance 1 
whose centers were randomly chosen in a hypercube with 
side length 10. Six views were generated on each dataset. 




















(a) Clustered (b) Uniform (c) Poses of planes 

Figure 3: Triplet generalization errors averaged across views 
for various datasets. 

Each view was produced by projecting data points onto a 
random subspace. The dimensions of the six random sub¬ 
spaces were 2, 3, 4, 5, 6, and 7 respectively. 

Results Embeddings were learned with embedding di¬ 
mensions D = b and 10. Triplet generalization errors are 
plotted in Fig. 3 (a) and (b) for clustered and uniform data, 
respectively. Our algorithm achieved lower triplet general¬ 
ization error than both independent and pooled methods on 
both datasets. The improvement was particularly large when 
the number of triplets was limited (less than 10,000 for the 
clustered case). The simple pooled method was the worst on 
both datasets. Note that in contrast to the pooled method, 
the proposed joint method can choose different embedding 
dimension automatically (due to the trace regularization) for 
each view while maintaining a shared subspace. 

Poses of airplanes 

Description This dataset was constructed from 200 im¬ 
ages of airplanes from the PASCAL VOC dataset (Evering- 
ham et al. 2010) which were annotated with 16 landmarks 
such as nose tip, wing tips, etc (Bourdev et al. 2010). We 
used these landmarks to construct a pose-based similarity. 
Given two planes and the positions of landmarks in these 
images, pose similarity was defined as the residual error of 
alignment between the two sets of landmarks under scaling 
and translation. We generated 5 views each of which was 
associated with a subset of these landmarks; see supplemen¬ 
tary material for details. Three annotated images from the set 
are shown in the left panel of Fig. 2. The planes are highly 
diverse ranging from passenger planes to fighter jets, vary¬ 
ing in size and form which results in a slightly different sim¬ 
ilarity between instances for each view. However, there is a 
strong correlation between the views because the underlying 
set of landmarks are shared. 

Results We used D = 3 and D = 10 as embedding di¬ 
mensions. Figure 3(c) shows the triplet generalization errors 
of the three methods. The proposed joint model performed 
clearly better than independent. This was not only in av¬ 
erage but also uniformly for each view (see supplementary 
material). The pooled method had a slightly larger error than 
the proposed joint learning approach but better than the in¬ 
dependent approach. 

CUB-200 birds data 

Description We used the dataset (Welinder et al. 2010) 
consisting of 200 species of birds and use the annotations 


collected using the setup of Wah et al. (2014; 2015). Briefly, 
similarity triplets among images of each species were col¬ 
lected in a crowd-sourced manner: every time, a user was 
asked to judge the similarity between an image of a bird 
from the target specie Zi and nine images of birds of different 
species {zk}keK: using the interface of Wilber et al. (2014), 
where K is the set of all 200 species. For each display, 
the user partitioned these nine images into two sets, Ksim 
and ICdissim^ with JCgim containing birds considered sim¬ 
ilar to the target and ICdissim having the ones considered 
dissimilar. Such a partition was broadcast to an equivalent 
set of triplet constraints on associated species, {(i,j,/) | 
j G ICsim^ I ^ J^dissim}- Therefore, for each user response, 
\J^sim \ \K^dissim \ triplet Constraints were obtained. 

To collect view-specific triplets, we presented 5 different 
cropped versions (e.g. beak, breast, wing) of the bird images 
as shown in the right panel of Fig. 2 and used the same pro¬ 
cedure as before to collect triplet comparisons. We obtained 
about 100,000 triplets from the uncropped original images 
and about 4,000 to 7,000 triplets from the 5 cropped views. 
This dataset refiects a more realistic situation where not all 
triplet relations are available and some of them may be noisy 
due to the nature of crowd-sourcing. 

In addition to the triplet generalization error, we evaluated 
the embeddings in a classification task using a biological 
taxonomy of the bird species. Note that in Wah et al. (2015) 
embeddings were used to interactively categorize images; 
here we simplify this process to enable detailed compar¬ 
isons. We manually grouped the 200 classes to get 6 super 
classes so that the number of objects in all classes were bal¬ 
anced. These class labels were not used in the training but 
allowed us to evaluate the quality of embeddings using the 
leave-one-out (LOO) classification error. More precisely, at 
the test stage, we predict the class label of each embedded 
point according to the labels of its 3-nearest-neighbours (3- 
NN) in the learned metric. 

Finally, since more triplets were available from the first 
(uncropped) view compared to other views, we first sampled 
equal numbers of triplets in each view up to a total of 18,000 
triplets. Afterwards, we added triplets only to the first view. 

Results We used D = 10 and D = 60 sls embedding di¬ 
mensions; note that joint learning in 60 dimensions roughly 
has the same number of parameters as independent learning 
in 10 dimensions. Figures 4 (a) and (b) show the triplet gen¬ 
eralization errors and the LOO 3-NN classification errors, 
respectively. The solid vertical line shows the point (18,000 
triplets) that we start to add training triplets only to the first 
view. Comparing joint learning in 10 dimensions and 60 di¬ 
mensions, we see that the higher dimension gives the lower 
error. The error of joint learning was better than indepen¬ 
dent learning for small number of triplets. Interestingly the 
error of joint learning in 60 dimensions coincides with that 
of independent learning in 10 dimensions after seeing 6,000 
triplets. This can be explained by the fact that with 6 views, 
the two models have comparable complexity (see discussion 
at the end of the previous section) and thus the same asymp¬ 
totic variance. Our method obtains lower leave-one-out clas- 
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(a) Triplet generalization error 


(b) LOO 3-NN classification error 


(c) Learning a new view 


Figure 4: Results on CUB-200 birds dataset, (c) shows the triplet generalization error on the second view. 


sification errors on all views except for the first view; see 
supplementary material. 

Learning a new view On the CUB-200 birds dataset, we 
simulated the situation of learning a new view (or zero-shot 
learning). We drew a training set that contains 100-1000 
triplets from the second view and 3,000 triplets from all 
other 5 views. We investigated how joint learning helps in 
estimating a good embedding on a new view with extremely 
small number of triplets. The triplet generalization errors of 
both approaches are shown in Fig. 4(c). The triplet general¬ 
ization error of the proposed joint learning was lower than 
that of the independent learning up to around 700 triplets. 
The embedding of the second view learned jointly with other 
views was clearly better than that learned independently and 
consistent with the quantitative evaluation; see supplemen¬ 
tary material. 

Performance gain and triplet consistency 

In Fig. 5, we relate the performance gain we obtained for 
the joint/pooled learning approaches compared to the inde¬ 
pendent learning approach with the underlying between-task 
similarity. The performance gain was measured by the dif¬ 
ference between the area under the triplet generalization er¬ 
rors normalized by that of the independent learning. The 
between-task similarity was measured by the triplet consis¬ 
tency between two views averaged over all pairs of views. 
For the CUB-200 dataset in which only a subset of valid 
triplet constraints are available, we take the independently 
learned embeddings with the largest number of triplets and 
use those to compute the triplet consistency. 

We can see that when the triplet consistency is very high, 
pooled learning approach is good enough. However, when 
the triplet consistency is not too high, it may harm to pool 
the triplets together. The proposed joint learning approach 
has the most advantage in this intermediate regime. On the 
other hand, the consistency was close to random (0.5) for the 
CUB-200 dataset possibly explaining why the performance 
gain was not as significant as in the other datasets. 

Incorporating features and class information 

The proposed method can be applied to a more general set¬ 
ting in which each object comes with a feature vector and a 
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Figure 5: Relating the performance gains of joint and pooled 
learning with the triplet consistency. 


loss function not derived from triplet constraints is used. 

As an example, we employ the idea of multi-task 
large margin nearest neighbor (MT-LMNN) algorithm 
(Parameswaran and Weinberger 2010) and adapt our model 
to handle a classification problem. The loss function of MT- 
LMNN consists of two terms. The first term is a hinge loss 
for triplet constraints as in (2) but the triplets are derived 
from class labels. The second term is the sum of squared dis¬ 
tances between each object and its “target neighbors” which 
is also defined based on class labels; see Weinberger et al. 
(2006; 2010) for details. The major difference between MT- 
LMNN and our model is that MT-LMNN parametrizes a lo¬ 
cal metric as the sum of a global average M o and a view- 
specific metric Mt sls Kt = Mq Mt’, thus the learned 
metric is generally full rank. On the other hand, our method 
parametrizes it as a product of global transform and local 
metric as Kt = LMtL^, which allows the local embed¬ 
ding dimension to be controlled by the trace regularization. 

We conduct experiments on ISOLET spoken alphabet 
recognition dataset (Fanty and Cole 1991) which consists 
7797 examples of English alphabets spoken by 150 subjects 
and each example is described by a 617 dimension feature 
vector. The task is to recognize the letter of each spoken 
example as one of the English alphabets. The subjects are 
grouped into sets of 30 similar speakers leading to 5 tasks. 

We adapt the experimental setting from the work of MT- 
LMNN. Data is first projected onto its first 378 leading PC A 
components that capture 99 % of variance. We train our 
model in a 77 = 378 dimensional space with 77 = 169 and 
378, and compare it with a MT-LMNN trained with the code 
provided by the authors. In the experiment, each task is ran- 
































Table 1: Test error rates on ISOLET dataset. 



Tested with view-specific train data 

Tested with all train data 

Task 

MT-LMNN 

Proposed method 

MT-LMNN 

Proposed method 


378 dim 

D = 169 

i9 = 378 

378 dim 

D = 169 

D = 378 

1 

4.68 

3.78 

4.10 

4.23 

3.46 

3.65 

2 

4.55 

3.91 

3.52 

3.14 

3.84 

3.40 

3 

6.28 

5.32 

5.64 

3.52 

3.39 

3.52 

4 

7.76 

5.83 

5.83 

4.23 

4.10 

3.52 

5 

6.28 

5.06 

5.19 

4.23 

3.97 

3.97 

Avg 

5.91 

4.78 

4.86 

3.87 

3.76 

3.61 


domly divided into 60/20/20 subsets for train/validation/test. 
We tuned the parameters on the validation sets. 

Test error rates of 3-nearest-neighbor (3-NN) classifiers 
are reported in Table 1. The left panel shows the errors us¬ 
ing only the view-specific training data for the classification. 
The right panel shows those using all the training data with 
view-specific distance. Results are averaged over 10 runs. 
Simpler baseline methods, such as, euclidean metric and 
pooled (single task) learning are not included here because 
MT-LMNN already performed better than them. We can see 
that the proposed method performed better than MT-LMNN, 
while learning in a 378 dimensional space and reducing to a 
169 dimensional space led to comparable error rates. A pos¬ 
sible explanation for this mild dependence on the choice of 
embedding dimension D could be given by the fact that both 
L and Mt are regularized and the effective embedding di¬ 
mension is determined by the regularization and not by the 
choice of D; see Prop. 1. The averaged error rates reported 
in the original paper using 169 PC A dimensions were 5.19 
% for the view-specific case and 4.01 % when all training 
data were used; our numbers are still better than theirs. 

Related work 

Embedding of objects from triplet or paired distance com¬ 
parisons goes back to the work of Shepard (1962a; 1962b) 
and Kruskal (1964a; 1964b) and studied extensively (Agar- 
wal et al. 2007; Tamuz et al. 2011; McEee and Lanckriet 
2009; 2011; van der Maaten and Weinberger 2012) recently. 

More recently, triplet embedding / metric learning prob¬ 
lems that involve multiple measures of similarity have been 
considered. Parameswaran and Weinberger (2010) aimed at 
jointly solving multiple related metric learning problems by 
exploiting possible similarities. More specifically, they mod¬ 
eled the inner product in each view by a sum of shared 
global matrix and a view-specific local matrix. Moreover, 
Rai, Lian, and Carin (2014) proposed a Bayesian approach 
to multi-task metric learning. Unfortunately, the sum struc¬ 
ture in their work typically do not produce a low-rank met¬ 
ric, which makes it unsuitable for learning view-specific 
embeddings. In contrast, our method models it as a prod¬ 
uct of them allowing the trace norm regularizer to deter¬ 
mine the rank of each local metric. Xie and Xing (2013) 
and Yu, Wang, and Tao (2012) studied metric learning prob¬ 
lems with multiple input views. This is different from our 
setting in which the notion of similarity varies from view 
to view. Amid and Ukkonen (2015) considered the task of 


multi-view triplet embedding in which the view is a latent 
variable; they proposed a greedy algorithm for finding the 
view membership of each object as well as its embedding. 
It could be useful to combine this approach with ours when 
we do not have enough resource to collect triplets from all 
possible views. 

Discussion 

We have proposed a model for jointly learning multiple mea¬ 
sures of similarities from multi-view triplet observations. 
The proposed model consists of a global transformation, 
which represents each object as a fixed dimensional vector, 
and local view-specific metrics. 

Experiments on both synthetic and real datasets have 
demonstrate that our proposed joint model outperforms in¬ 
dependent and pooled learning approaches in most cases. 
Additionally, we have shown that the advantage of our joint 
learning approach becomes the most prominent when the 
views are similar but not too similar (which can be mea¬ 
sured by triplet consistency). Morevoer, we have extended 
our model to incorporate class labels and feature vectors. 
The proposed model performed favorably compared to MT- 
LMNN on ISOLET dataset. Since in many real applications, 
similarity triplets can be expensive to obtain, jointly learning 
similarity metrics is preferable as it can recover the underly¬ 
ing structure using relatively small number of triplets. 

One way to look at the proposed model is to view the 
shared global transformation as controlling the complex¬ 
ity. However our experiments have shown that generally the 
higher the dimension, the better the performance (except for 
the ISOLET dataset tested with view-specific training data). 
Thus an alternative explanation could be that the regulariza¬ 
tion on both the global transformation L and local metrics 
Mf is implicitly controlling the embedding dimension. 

Euture work includes extension of the current model to 
other loss functions (e.g., the t-STE loss (van der Maaten 
and Weinberger 2012)) and to the setting in which we do not 
know which view each triplet came from. 
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Supplementary Material 


Proof of Proposition 1 


We repeat the statement for convenience. 

Proposition 1. 

/ T 


mm 


h^tT{Mt) + /3\\L\\%: LMtL^ = 2v^tr 

Here the power \j2in the right-hand side is the matrix square root. 

Proof. Let’s define M = Mt. For any decomposition Kt = LMfL^, we have 

/ T \ 


1/2 


2y^tr 


vt=i 




1/2 




i=i 


i=i 

i=i 

= 7 tr(M) + /3||i||^, 

T 




t=l 


where || • ||* is the nuclear norm (Fazel, Hindi, and Boyd 2001); the fourth line follows from Theorem 3.3.14 (a) in Horn & 
Johnson (Horn and Johnson 1991), and the fifth line is due to the arithmetic mean-geometric mean inequality. 

Let K := and K = UAU^ be its eigenvalue decomposition. The equality is achieved by choosing 

L = t7A^/'^(7/^)^/^ (3) 

Mt=A-^/^U^KtUA-^/\l3ljy/^ = (4) 

Note that even when K is singular, Kf is spanned by K and by restricting to the subspace spanned by K, the above discussion 
is still valid. □ 


This lemma can be understood analogously to the identity regarding the nuclear norm(Srebro, Rennie, and Jaakkola 2005) 

||X||*=mini(||Z7|||p + ||V||2,) subject to X = UV^. 

Note that the fact that the ratio of the two hyperparameters (3/^ can be absorbed in the scale ambiguity between L and Mt as 
in (3) and (4) is special to multiplicative models like our model and the nuclear norm and would not hold for an additive model 
like MT-LMNN. 
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Figure 6: Triplet generalization errors. The small figures shows errors on individual views and the large figures show the average. 
(Left) Clustered synthetic data. (Right) Uniformly distributed data. 
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Figure 7: Leave-one-out 3-nearest-neighbour classification errors on clustered synthetic data. The small figures shows errors on 
individual views and the large figures show the average. 


Additional details and results 

Synthetic dataset 

In addition to the results in main paper, we illustrate view-specific triplet generalization error in Figure 6 and leave-one-out 
classification error for clustered synthetic data in Figure 7. 

Poses of airplanes dataset 

Details of annotations and view generation Each of the 200 airplanes were annotated with 16 landmarks namely, 


01. Top-Rudder 
02. Bot_Rudder 
03. L_Stabilizer 
04. R_Stabilizer 


05. L-WingTip 
06. R_WingTip 
07. NoseTip 
08. Nose.Top 


09. Nose_Bottom 

10. Left_Wing_Base 

11. Right-Wing-Base 

12. Left-Engine-Front 


13. Left_Engine-Back 

14. Right_Engine_Front 

15. Right_Engine_Back 

16. Bot-Rudder_Front 


This is also illustrated in the Figure 8. The five different views are defined by considering different subsets of landmarks as 
follows: 

1. alle {1,2 ,..., 20} 

2. backe {1,2,3,4,16} 

3. nose G {7, 8, 9} 

4. back+wings G {1, 2,..., 6,10,11,..., 16} 

5. nose+wings G {5,6,..., 15} 

For triplet (A, 5, C) we compute similarity s^(A, B) and s^(A, C) by aligning the subset i of landmarks of B and C to A 
under a translation and scaling that minimizes the sum of squared error after alignment. The similarity is inversely proportional 
to the residual error. This is also known as “procrustes analysis” commonly used for matching shapes. 

In addition to the results in main paper, we illustrate view-specific triplet generalization error and leave-one-out 3-nearest- 
neighbour classification error in Figure 9. 















































































Learned embedding Figure 10 shows a 2D projection of the global view of the objects onto their first two principle dimen¬ 
sions. The visualization shows that objects roughly lies on a circle corresponding to the left-right and up-down orientation. 

CUB-200 birds dataset 

Here, we also include the view-specific generalization errors and leave-one-out classification errors for CUB-200 Birds Dataset. 
See Figure 11. 

Public figures face dataset 

Description Public Figures Face Database is created by Kumar et ( 2 /.(Kumar et al. 2009). It consists of 58,797 images of 200 
people. Every image is characterized by 75 attributes which are real valued and describe the appearance of the person in the 
image. We selected 39 of the attributes and categorized them into 5 groups according to the aspects they describe: hair, age, 
accessory, shape and ethnicity. We randomly selected ten people and drew 20 images for each of them to create a dataset with 
200 images. Similarity between instances for a given group is equal to the dot product between their attribute vectors where the 
attributes are restricted to those in the group. We describe the details of these attributes below. Each group is considered as a 
local view and identities of the people in the images are considered as class labels. 

Attributes Each image in the Public Figures Face Dataset (Pubfig) ^ is characterized by 75 attributes. We used 39 of the 
attributes in our work and categorized them into 5 groups according to the aspects they describe. Here is a table of the categories 
and attributes: 


Category 

Attributes 

Hair 

Black Hair, Blond Hair, Brown Hair, Gray Hair, Bald, Curly Hair, Wavy Hair, Straight Hair, 
Receding Hairline, Bangs, Sideburns. 

Age 

Baby, Child, Youth,Middle Aged,Senior. 

Accessory 

No Eyewear, Eyeglasses, Sunglasses, Wearing Hat, Wearing Lipstick, Heavy Makeup, Wear¬ 
ing Earrings, Wearing Necktie, Wearing Necklace. 

Shape 

Oval Face, Round Face, Square Face, High Cheekbones, Big Nose, Pointy Nose, Round Jaw, 
Narrow Eyes, Big Lips, Strong Nose-Mouth Lines. 

Ethnicity 

Asian, Black, White, Indian. 


Table 2: List of Pubfig attributes that were used in our work. 


Results The 200 images are embedded into 5, 10, and 20 dimensional spaces. We draw triplets randomly from the ground 
truth similarity measure to form training and test sets. Triplet generalization errors and classification errors are shown in Fig. 12. 

In terms of the triplet generalization error, the joint learning reduces the error faster than the independent learning up to 
around 10,000 triplets where the decrease slows down. Since the error in this regime reduces monotonically with increasing 
number of dimensions, this can be understood as a bias induced by the joint learning. On the other hand, when we have less than 
10,000 triples, the error of the joint learning increases (but not as large as the independent learning) as dimension increases; this 
can be understood as a variance. When embedding in a 20 dimensional space, the joint learning has lower or comparable error 
to independent learning even when 10^ triplets are available. In terms of the leave-one-out classification error, joint learning 
continues to be better even when the number of triplets are very large. 

Learning a new view 

Figure 13 shows a 2D projection of the embeddings learned by the independent approach and the proposed joint approach in 
the setting for CUB-200 birds dataset described in the part of ‘‘learning a new view” in the main text. Clearly the proposed 
joint learning approach obtains a better separated clusters compared to the independent approach. 

Relating the performance gain with the triplet consistency 


^Available at http : //www. cs . Columbia . edu/CAVE/databases/pubfig/ 




Table 3: Relating the performance gain of joint and pooled learning with the between-task similarity. 



CUB-200 

PubFig 

Synthetic 

(uniform) 

Synthetic 

(clustered) 

Airplanes 

Average triplet con¬ 
sistency 

0.53 

0.59 

0.6 

0.69 

0.85 

Performance gain 
of joint learning(%) 

-4.6 

6.5 

26 

44 

35 

Performance gain 
of pooled learn- 
ing(%) 

-8.0 

-56 

-40 

-29 

23 



Figure 8: Landmarks illustrated on the several planes 
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Figure 9: Experimental results on poses of planes dataset. The small figures shows errors on individual views and the large 
figures show the average. (Left) Triplet generalization errors on poses of planes dataset. (Right) Leave-one-out 3-nearest- 
neighbour classification error. 
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Figure 10: The global view of embeddings of poses of planes. 



Figure 11: Results on CUB-200 birds dataset. The small figures shows errors on individual views and the large figures show the 
average, (a) triplet generalization error, (b) leave-one-out 3-nearest-neighbor classification error. 
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Figure 12: Results on public figures faces dataset. Embeddings are learned in a 5 dimensional space, a 10 dimensional space 
and a 20 dimensional space, (a) Triplet generalization error, (b) Leave-one-out 3-nearest-neighbor classification error. The small 
figures shows errors on individual views and the large figures show the average. 
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Figure 13: Learning a new view on CUB-200 birds dataset. Training data contains 100 triplets from the second local view 
and 3,000 triplets from other 5 views. Embeddings are learned in a 10 dimensional space and then further embedded in a 2 
dimensional plane by using tSNE (van der Maaten and Hinton 2008) for the purpose of visualization. Left: triplet generalization 
error on the second local view. Middle: embedding learned independently. Right: embedding learned jointly. 














